Creating TCGA cohorts (part 1)

This notebook will show you how to create a TCGA cohort using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot.

We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:

  • BigQuery,
  • the BigQuery web UI where you can run queries interactively,
  • IPython (now known as Jupyter), and
  • Cloud Datalab the recently announced interactive cloud-based platform that this notebook is being developed on.

There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).

OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:


In [1]:
import gcp.bigquery as bq

The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are in a dataset called tcga_201607_beta, owned by the isb-cgc project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>. Let's start by getting some basic information about the tables in this dataset:


In [2]:
d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows  %12d bytes   %s' \
      % (t.metadata.rows, t.metadata.size, t.name.table_id)


      6322 rows       1729204 bytes   Annotations
     23797 rows       6382147 bytes   Biospecimen_data
     11160 rows       4201379 bytes   Clinical_data
   2646095 rows     333774244 bytes   Copy_Number_segments
3944304319 rows  445303830985 bytes   DNA_Methylation_betas
 382335670 rows   43164264006 bytes   DNA_Methylation_chr1
 197519895 rows   22301345198 bytes   DNA_Methylation_chr10
 235823572 rows   26623975945 bytes   DNA_Methylation_chr11
 198050739 rows   22359642619 bytes   DNA_Methylation_chr12
  97301675 rows   10986815862 bytes   DNA_Methylation_chr13
 123239379 rows   13913712352 bytes   DNA_Methylation_chr14
 124566185 rows   14064712239 bytes   DNA_Methylation_chr15
 179772812 rows   20296128173 bytes   DNA_Methylation_chr16
 234003341 rows   26417830751 bytes   DNA_Methylation_chr17
  50216619 rows    5669139362 bytes   DNA_Methylation_chr18
 211386795 rows   23862583107 bytes   DNA_Methylation_chr19
 279668485 rows   31577200462 bytes   DNA_Methylation_chr2
  86858120 rows    9805923353 bytes   DNA_Methylation_chr20
  35410447 rows    3997986812 bytes   DNA_Methylation_chr21
  70676468 rows    7978947938 bytes   DNA_Methylation_chr22
 201119616 rows   22705358910 bytes   DNA_Methylation_chr3
 159148744 rows   17968482285 bytes   DNA_Methylation_chr4
 195864180 rows   22113162401 bytes   DNA_Methylation_chr5
 290275524 rows   32772371379 bytes   DNA_Methylation_chr6
 240010275 rows   27097948808 bytes   DNA_Methylation_chr7
 164810092 rows   18607886221 bytes   DNA_Methylation_chr8
  81260723 rows    9173717922 bytes   DNA_Methylation_chr9
  98082681 rows   11072059468 bytes   DNA_Methylation_chrX
   2330426 rows     263109775 bytes   DNA_Methylation_chrY
   1867233 rows     207365611 bytes   Protein_RPPA_data
   5356089 rows    5715538107 bytes   Somatic_Mutation_calls
   5738048 rows     657855993 bytes   mRNA_BCGSC_GA_RPKM
  38299138 rows    4459086535 bytes   mRNA_BCGSC_HiSeq_RPKM
  44037186 rows    5116942528 bytes   mRNA_BCGSC_RPKM
  16794358 rows    1934755686 bytes   mRNA_UNC_GA_RSEM
 211284521 rows   24942992190 bytes   mRNA_UNC_HiSeq_RSEM
 228078879 rows   26877747876 bytes   mRNA_UNC_RSEM
  11997545 rows    2000881026 bytes   miRNA_BCGSC_GA_isoform
   4503046 rows     527101917 bytes   miRNA_BCGSC_GA_mirna
  90237323 rows   15289326462 bytes   miRNA_BCGSC_HiSeq_isoform
  28207741 rows    3381212265 bytes   miRNA_BCGSC_HiSeq_mirna
 102234868 rows   17290207488 bytes   miRNA_BCGSC_isoform
  32710787 rows    3908314182 bytes   miRNA_BCGSC_mirna
  26763022 rows    3265303352 bytes   miRNA_Expression

In this tutorial, we are going to look at a few different ways that we can use the information in these tables to create cohorts. Now, you maybe asking what we mean by "cohort" and why you might be interested in creating one, or maybe what it even means to "create" a cohort. The TCGA dataset includes clinical, biospecimen, and molecular data from over 10,000 cancer patients who agreed to be a part of this landmark research project to build The Cancer Genome Atlas. This large dataset was originally organized and studied according to cancer type but now that this multi-year project is nearing completion, with over 30 types of cancer and over 10,000 tumors analyzed, you have the opportunity to look at this dataset from whichever angle most interests you. Maybe you are particularly interested in early-onset cancers, or gastro-intestinal cancers, or a specific type of genetic mutation. This is where the idea of a "cohort" comes in. The original TCGA "cohorts" were based on cancer type (aka "study"), but now you can define a cohort based on virtually any clinical or molecular feature by querying these BigQuery tables. A cohort is simply a list of samples, using the TCGA barcode system. Once you have created a cohort you can use it in any number of ways: you could further explore the data available for one cohort, or compare one cohort to another, for example.

In the rest of this tutorial, we will create several different cohorts based on different motivating research questions. We hope that these examples will provide you with a starting point from which you can build, to answer your own research questions.

Exploring the Clinical data table

Let's start by looking at the clinical data table. The TCGA dataset contains a few very basic clinical data elements for almost all patients, and contains additional information for some tumor types only. For example smoking history information is generally available only for lung cancer patients, and BMI (body mass index) is only available for tumor types where that is a known significant risk factor. Let's take a look at the clinical data table and see how many different pieces of information are available to us:


In [3]:
%bigquery schema --table isb-cgc:tcga_201607_beta.Clinical_data


Out[3]:

That's a lot of fields! We can also get at the schema programmatically:


In [4]:
table = bq.Table('isb-cgc:tcga_201607_beta.Clinical_data')
if ( table.exists() ):
    fieldNames = map(lambda tsf: tsf.name, table.schema)
    fieldTypes = map(lambda tsf: tsf.data_type, table.schema)
    print " This table has %d fields. " % ( len(fieldNames) )
    print " The first few field names and types are: " 
    print "     ", fieldNames[:5]
    print "     ", fieldTypes[:5]
else: 
    print " There is no existing table called %s:%s.%s" % ( table.name.project_id, table.name.dataset_id, table.name.table_id )


 This table has 70 fields. 
 The first few field names and types are: 
      [u'ParticipantBarcode', u'Study', u'Project', u'ParticipantUUID', u'TSSCode']
      [u'STRING', u'STRING', u'STRING', u'STRING', u'STRING']

Let's look at these fields and see which ones might be the most "interesting", by looking at how many times they are filled-in (not NULL), or how much variation exists in the values. If we wanted to look at just a single field, "tobacco_smoking_history" for example, we could use a very simple query to get a basic summary:


In [5]:
%%sql 

SELECT tobacco_smoking_history, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
GROUP BY tobacco_smoking_history
ORDER BY n DESC


Out[5]:
tobacco_smoking_historyn
 8161
1865
4799
2710
3568
557

(rows: 6, time: 1.3s, 8KB processed, job: job_Tp9_oGNMvJ7f0ZcTwdt2TirZePk)

But if we want to loop over all fields and get a sense of which fields might provide us with useful criteria for specifying a cohort, we'll want to automate that. We'll put a threshold on the minimum number of patients that we expect information for, and the maximum number of unique values (since fields such as the "ParticipantBarcode" will be unique for every patient and, although we will need that field later, it's probably not useful for defining a cohort).


In [15]:
numPatients = table.metadata.rows
print " The %s table describes a total of %d patients. " % ( table.name.table_id, numPatients )

# let's set a threshold for the minimum number of values that a field should have,
# and also the maximum number of unique values
minNumPatients = int(numPatients*0.80)
maxNumValues = 50

numInteresting = 0
iList = []
for iField in range(len(fieldNames)):
  aField = fieldNames[iField]
  aType = fieldTypes[iField]
  try:
    qString = "SELECT {0} FROM [{1}]".format(aField,table)
    query = bq.Query(qString)
    df = query.to_dataframe()
    summary = df[str(aField)].describe()
    if ( aType == "STRING" ):
      topFrac = float(summary['freq'])/float(summary['count'])
      if ( summary['count'] >= minNumPatients ):
        if ( summary['unique'] <= maxNumValues and summary['unique'] > 1 ):
          if ( topFrac < 0.90 ):
            numInteresting += 1
            iList += [aField]
            print "     > %s has %d values with %d unique (%s occurs %d times) " \
              % (str(aField), summary['count'], summary['unique'], summary['top'], summary['freq'])
    else:
      if ( summary['count'] >= minNumPatients ):
        if ( summary['std'] > 0.1 ):
          numInteresting += 1
          iList += [aField]
          print "     > %s has %d values (mean=%.0f, sigma=%.0f) " \
            % (str(aField), summary['count'], summary['mean'], summary['std'])
  except:
    pass

print " "
print " Found %d potentially interesting features: " % numInteresting
print "   ", iList


 The Clinical_data table describes a total of 11160 patients. 
     > Study has 11160 values with 33 unique (BRCA occurs 1097 times) 
     > age_at_initial_pathologic_diagnosis has 11109 values (mean=59, sigma=14) 
     > batch_number has 11160 values (mean=203, sigma=135) 
     > vital_status has 11156 values with 2 unique (Alive occurs 7534 times) 
     > days_to_birth has 11041 values (mean=-21763, sigma=5266) 
     > days_to_last_known_alive has 11102 values (mean=1037, sigma=1041) 
     > gender has 11160 values with 2 unique (FEMALE occurs 5815 times) 
     > year_of_initial_pathologic_diagnosis has 11030 values (mean=2008, sigma=4) 
     > person_neoplasm_cancer_status has 10236 values with 2 unique (TUMOR FREE occurs 6507 times) 
     > race has 9835 values with 5 unique (WHITE occurs 8186 times) 
 
 Found 10 potentially interesting features: 
    [u'Study', u'age_at_initial_pathologic_diagnosis', u'batch_number', u'vital_status', u'days_to_birth', u'days_to_last_known_alive', u'gender', u'year_of_initial_pathologic_diagnosis', u'person_neoplasm_cancer_status', u'race']

The above helps us narrow down on which fields are likely to be the most useful, but if you have a specific interest, for example in menopause or HPV status, you can still look at those in more detail very easily:


In [16]:
%%sql
SELECT menopause_status, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY menopause_status
ORDER BY n DESC


Out[16]:
menopause_statusn
Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)1291
Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)389
Peri (6-12 months since last menstrual period)82
Indeterminate (neither Pre or Postmenopausal)54

(rows: 4, time: 0.8s, cached, job: job_R6HF_RzYdXcntPKVIy06TNbDi8Q)

We might wonder which specific tumor types have menopause information:


In [17]:
%%sql
SELECT Study, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE menopause_status IS NOT NULL
GROUP BY Study
ORDER BY n DESC


Out[17]:
Studyn
BRCA1007
UCEC517
CESC237
UCS55

(rows: 4, time: 0.5s, cached, job: job_WIGSHA6zqYL1yChZi9jnwh-ZL-A)

In [18]:
%%sql
SELECT hpv_status, hpv_calls, COUNT(*) AS n
FROM [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE hpv_status IS NOT NULL
GROUP BY hpv_status, hpv_calls
HAVING n > 20
ORDER BY n DESC


Out[18]:
hpv_statushpv_callsn
Negative 664
PositiveHPV16238
PositiveHPV1841
PositiveHPV3325
PositiveHPV4524

(rows: 5, time: 0.6s, cached, job: job_4BS0EXKldZBN8Y2QcbPA--vK6_g)

TCGA Annotations

An additional factor to consider, when creating a cohort is that there may be additional information that might lead one to exclude a particular patient from a cohort. In certain instances, patients have been redacted or excluded from analyses for reasons such as prior treatment, etc, but since different researchers may have different criteria for using or excluding certain patients or certain samples from their analyses, in many cases the data is still available while at the same time "annotations" may have been entered into a searchable database. These annotations have also been uploaded into a BigQuery table and can be used in conjuction with the other BigQuery tables.

Early-onset Breast Cancer

Now that we have a better idea of what types of information is available in the Clinical data table, let's create a cohort consisting of female breast-cancer patients, diagnosed at the age of 50 or younger.

In this next code cell, we define several queries within a module which allows us to use them both individually and by reference in the final, main query.

  • the first query, called select_on_annotations, finds all patients in the Annotations table which have either been 'redacted' or had 'unacceptable prior treatment';
  • the second query, select_on_clinical selects all female breast-cancer patients who were diagnosed at age 50 or younger, while also pulling out a few additional fields that might be of interest; and
  • the final query joins these two together and returns just those patients that meet the clinical-criteria and do not meet the exclusion-criteria.

In [19]:
%%sql --module createCohort_and_checkAnnotations

DEFINE QUERY select_on_annotations
SELECT
  ParticipantBarcode,
  annotationCategoryName AS categoryName,
  annotationClassification AS classificationName
FROM
  [isb-cgc:tcga_201607_beta.Annotations]
WHERE
  ( itemTypeName="Patient"
    AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
      OR annotationClassification="Redaction" ) )
GROUP BY
  ParticipantBarcode,
  categoryName,
  classificationName

DEFINE QUERY select_on_clinical
SELECT
  ParticipantBarcode,
  vital_status,
  days_to_last_known_alive,
  ethnicity,
  histological_type,
  menopause_status,
  race
FROM
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ( Study="BRCA"
    AND age_at_initial_pathologic_diagnosis<=50
    AND gender="FEMALE" )

SELECT
  c.ParticipantBarcode AS ParticipantBarcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    a.ParticipantBarcode,
    c.ParticipantBarcode,
  FROM ( $select_on_annotations ) AS a
  OUTER JOIN EACH 
       ( $select_on_clinical ) AS c
  ON
    a.ParticipantBarcode = c.ParticipantBarcode
  WHERE
    (a.ParticipantBarcode IS NOT NULL
      OR c.ParticipantBarcode IS NOT NULL)
  ORDER BY
    a.classificationName,
    a.categoryName,
    a.ParticipantBarcode,
    c.ParticipantBarcode )
WHERE
  ( a.categoryName IS NULL
    AND a.classificationName IS NULL
    AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
  c.ParticipantBarcode

Here we explicitly call just the first query in the module, and we get a list of 212 patients with one of these disqualifying annotations:


In [20]:
bq.Query(createCohort_and_checkAnnotations.select_on_annotations).results().to_dataframe()


Out[20]:
ParticipantBarcode categoryName classificationName
0 TCGA-01-0629 Tumor tissue origin incorrect Redaction
1 TCGA-13-1479 Tumor tissue origin incorrect Redaction
2 TCGA-01-0638 Tumor tissue origin incorrect Redaction
3 TCGA-33-4579 Tumor tissue origin incorrect Redaction
4 TCGA-GN-A261 Tumor tissue origin incorrect Redaction
5 TCGA-66-2751 Genotype mismatch Redaction
6 TCGA-66-2752 Genotype mismatch Redaction
7 TCGA-66-2750 Genotype mismatch Redaction
8 TCGA-66-2746 Genotype mismatch Redaction
9 TCGA-66-2747 Genotype mismatch Redaction
10 TCGA-35-3621 Genotype mismatch Redaction
11 TCGA-02-0002 Genotype mismatch Redaction
12 TCGA-02-0117 Genotype mismatch Redaction
13 TCGA-08-0384 Genotype mismatch Redaction
14 TCGA-E2-A1IP Genotype mismatch Redaction
15 TCGA-14-0784 Genotype mismatch Redaction
16 TCGA-14-1036 Genotype mismatch Redaction
17 TCGA-06-0748 Genotype mismatch Redaction
18 TCGA-02-2488 Genotype mismatch Redaction
19 TCGA-14-1824 Genotype mismatch Redaction
20 TCGA-PN-A8M9 Genotype mismatch Redaction
21 TCGA-12-1601 Subject withdrew consent Redaction
22 TCGA-12-0653 Subject withdrew consent Redaction
23 TCGA-32-2498 Subject withdrew consent Redaction
24 TCGA-AF-3912 Subject withdrew consent Redaction
25 TCGA-A6-2670 Subject withdrew consent Redaction
26 TCGA-06-0131 Subject withdrew consent Redaction
27 TCGA-AN-A0FG Subject identity unknown Redaction
28 TCGA-AN-A0FE Subject identity unknown Redaction
29 TCGA-F4-6857 Subject identity unknown Redaction
... ... ... ...
182 TCGA-AP-A053 History of unacceptable prior treatment relate... Notification
183 TCGA-AX-A06D History of unacceptable prior treatment relate... Notification
184 TCGA-AX-A1CP History of unacceptable prior treatment relate... Notification
185 TCGA-AX-A1CR History of unacceptable prior treatment relate... Notification
186 TCGA-AX-A2H8 History of unacceptable prior treatment relate... Notification
187 TCGA-AX-A2HF History of unacceptable prior treatment relate... Notification
188 TCGA-AX-A3G3 History of unacceptable prior treatment relate... Notification
189 TCGA-B5-A0KB History of unacceptable prior treatment relate... Notification
190 TCGA-BG-A221 History of unacceptable prior treatment relate... Notification
191 TCGA-D1-A3JP History of unacceptable prior treatment relate... Notification
192 TCGA-EY-A1G8 History of unacceptable prior treatment relate... Notification
193 TCGA-L5-A88T History of unacceptable prior treatment relate... Notification
194 TCGA-WB-A820 History of unacceptable prior treatment relate... Notification
195 TCGA-XK-AAJ3 History of unacceptable prior treatment relate... Notification
196 TCGA-EJ-7312 History of unacceptable prior treatment relate... Notification
197 TCGA-96-A4JK History of unacceptable prior treatment relate... Notification
198 TCGA-2G-AAFE History of unacceptable prior treatment relate... Notification
199 TCGA-IC-A6RF History of unacceptable prior treatment relate... Notification
200 TCGA-BA-4075 History of unacceptable prior treatment relate... Notification
201 TCGA-06-6391 History of unacceptable prior treatment relate... Notification
202 TCGA-5L-AAT1 History of unacceptable prior treatment relate... Notification
203 TCGA-HT-A619 History of unacceptable prior treatment relate... Notification
204 TCGA-T1-A6J8 History of unacceptable prior treatment relate... Notification
205 TCGA-BG-A0M8 History of unacceptable prior treatment relate... Notification
206 TCGA-XK-AAK1 History of unacceptable prior treatment relate... Notification
207 TCGA-BH-A0B6 History of unacceptable prior treatment relate... Notification
208 TCGA-BG-A0MS History of unacceptable prior treatment relate... Notification
209 TCGA-AR-A2LR History of unacceptable prior treatment relate... Notification
210 TCGA-BH-A1F5 History of unacceptable prior treatment relate... Notification
211 TCGA-DM-A286 Inadvertently shipped Redaction

212 rows × 3 columns

and here we explicitly call just the second query, resulting in 329 patients:


In [21]:
bq.Query(createCohort_and_checkAnnotations.select_on_clinical).results().to_dataframe()


Out[21]:
ParticipantBarcode vital_status days_to_last_known_alive ethnicity histological_type menopause_status race
0 TCGA-BH-A18M Dead 2207 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None WHITE
1 TCGA-BH-A18V Dead 1556 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None WHITE
2 TCGA-Z7-A8R6 Alive 3256 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
3 TCGA-3C-AALI Alive 4005 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
4 TCGA-4H-AAAK Alive 348 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
5 TCGA-5L-AAT0 Alive 1477 HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
6 TCGA-A1-A0SN Alive 1196 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
7 TCGA-A1-A0SJ Alive 416 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None BLACK OR AFRICAN AMERICAN
8 TCGA-A1-A0SQ Alive 554 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
9 TCGA-A1-A0SP Alive 584 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None None
10 TCGA-A1-A0SH Alive 1437 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
11 TCGA-A2-A25E Alive 3204 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
12 TCGA-A2-A25B Alive 1291 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
13 TCGA-A2-A0SX Alive 1534 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
14 TCGA-A2-A0YL Alive 1474 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
15 TCGA-A2-A0D4 Alive 767 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
16 TCGA-A2-A3XZ Alive 1532 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
17 TCGA-A2-A0D2 Alive 1027 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
18 TCGA-A2-A04U Alive 2654 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Peri (6-12 months since last menstrual period) WHITE
19 TCGA-A2-A04V Dead 1920 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
20 TCGA-A2-A0CV Alive 3011 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... ASIAN
21 TCGA-A2-A0T3 Alive 1516 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
22 TCGA-A2-A3XU Dead 912 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
23 TCGA-A2-A3XV Alive 996 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
24 TCGA-A2-A0T6 Alive 575 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
25 TCGA-A2-A0D3 Alive 1873 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
26 TCGA-A2-A0EX Alive 752 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
27 TCGA-A2-A25A Alive 3276 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
28 TCGA-A2-A3XT Alive 2770 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
29 TCGA-A2-A1G0 Alive 616 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
... ... ... ... ... ... ... ...
299 TCGA-GM-A2DA Dead 6593 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
300 TCGA-GM-A2DL Alive 3519 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Peri (6-12 months since last menstrual period) WHITE
301 TCGA-GM-A3XN Alive 2019 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE
302 TCGA-GM-A3XL Alive 2108 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
303 TCGA-GM-A3XG Alive 1330 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
304 TCGA-HN-A2OB Dead 1900 None Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... None
305 TCGA-JL-A3YX Alive 352 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... ASIAN
306 TCGA-JL-A3YW Alive 360 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... ASIAN
307 TCGA-LD-A66U Alive 646 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma None WHITE
308 TCGA-LL-A5YP Alive 450 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
309 TCGA-LL-A7SZ Alive 594 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
310 TCGA-LL-A6FR Alive 489 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
311 TCGA-LL-A5YN Alive 447 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
312 TCGA-LL-A5YO Alive 440 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
313 TCGA-MS-A51U Alive 681 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma None WHITE
314 TCGA-OL-A5RV Alive 1062 None Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
315 TCGA-OL-A5RW Alive 1106 None Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
316 TCGA-OL-A5D8 Alive 973 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
317 TCGA-OL-A6VQ Alive 600 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
318 TCGA-OL-A6VO Alive 858 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None BLACK OR AFRICAN AMERICAN
319 TCGA-OL-A66I Alive 714 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None BLACK OR AFRICAN AMERICAN
320 TCGA-OL-A6VR Alive 1220 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None BLACK OR AFRICAN AMERICAN
321 TCGA-OL-A66O Alive 528 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
322 TCGA-PE-A5DE Alive 2645 None Infiltrating Lobular Carcinoma Pre (<6 months since LMP AND no prior bilatera... WHITE
323 TCGA-PL-A8LZ Alive 302 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
324 TCGA-PL-A8LY Alive 8 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
325 TCGA-PL-A8LX Alive 5 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Pre (<6 months since LMP AND no prior bilatera... BLACK OR AFRICAN AMERICAN
326 TCGA-S3-AA14 Alive 529 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... BLACK OR AFRICAN AMERICAN
327 TCGA-S3-A6ZH Alive 641 NOT HISPANIC OR LATINO Infiltrating Ductal Carcinoma None BLACK OR AFRICAN AMERICAN
328 TCGA-XX-A899 Alive 467 NOT HISPANIC OR LATINO Infiltrating Lobular Carcinoma Post (prior bilateral ovariectomy OR >12 mo si... WHITE

329 rows × 7 columns

and finally we call the main query:


In [22]:
bq.Query(createCohort_and_checkAnnotations).results().to_dataframe()


Out[22]:
ParticipantBarcode
0 TCGA-3C-AALI
1 TCGA-4H-AAAK
2 TCGA-5L-AAT0
3 TCGA-A1-A0SH
4 TCGA-A1-A0SJ
5 TCGA-A1-A0SN
6 TCGA-A1-A0SP
7 TCGA-A1-A0SQ
8 TCGA-A2-A04P
9 TCGA-A2-A04Q
10 TCGA-A2-A04R
11 TCGA-A2-A04U
12 TCGA-A2-A04V
13 TCGA-A2-A04W
14 TCGA-A2-A04X
15 TCGA-A2-A0CL
16 TCGA-A2-A0CM
17 TCGA-A2-A0CV
18 TCGA-A2-A0CZ
19 TCGA-A2-A0D2
20 TCGA-A2-A0D3
21 TCGA-A2-A0D4
22 TCGA-A2-A0EX
23 TCGA-A2-A0SX
24 TCGA-A2-A0T3
25 TCGA-A2-A0T5
26 TCGA-A2-A0T6
27 TCGA-A2-A0YE
28 TCGA-A2-A0YJ
29 TCGA-A2-A0YL
... ...
297 TCGA-GM-A2DL
298 TCGA-GM-A3XG
299 TCGA-GM-A3XL
300 TCGA-GM-A3XN
301 TCGA-HN-A2OB
302 TCGA-JL-A3YW
303 TCGA-JL-A3YX
304 TCGA-LD-A66U
305 TCGA-LL-A5YN
306 TCGA-LL-A5YO
307 TCGA-LL-A5YP
308 TCGA-LL-A6FR
309 TCGA-LL-A7SZ
310 TCGA-MS-A51U
311 TCGA-OL-A5D8
312 TCGA-OL-A5RV
313 TCGA-OL-A5RW
314 TCGA-OL-A66I
315 TCGA-OL-A66O
316 TCGA-OL-A6VO
317 TCGA-OL-A6VQ
318 TCGA-OL-A6VR
319 TCGA-PE-A5DE
320 TCGA-PL-A8LX
321 TCGA-PL-A8LY
322 TCGA-PL-A8LZ
323 TCGA-S3-A6ZH
324 TCGA-S3-AA14
325 TCGA-XX-A899
326 TCGA-Z7-A8R6

327 rows × 1 columns

Note that we didn't need to call each sub-query individually, we could have just called the main query and gotten the same result. As you can see, two patients that met the clinical select criteria (which returned 329 patients) were excluded from the final result (which returned 327 patients).

Before we leave off, here are a few useful tricks for working with BigQuery in Cloud Datalab:

  • if you want to see the raw SQL, you can just build the query and then print it out (this might be useful, for example, in debugging a query -- you can copy and paste the SQL directly into the BigQuery Web UI);
  • if you want to see how much data and which tables are going to be touched by this data, you can use the "dry run" option. (Notice the "cacheHit" flag -- if you have recently done a particular query, you will not be charged to repeat it since it will have been cached.)

In [23]:
q = bq.Query(createCohort_and_checkAnnotations)
q


Out[23]:
SELECT
  c.ParticipantBarcode AS ParticipantBarcode
FROM (
  SELECT
    a.categoryName,
    a.classificationName,
    a.ParticipantBarcode,
    c.ParticipantBarcode,
  FROM ( (SELECT
  ParticipantBarcode,
  annotationCategoryName AS categoryName,
  annotationClassification AS classificationName
FROM
  [isb-cgc:tcga_201607_beta.Annotations]
WHERE
  ( itemTypeName="Patient"
    AND (annotationCategoryName="History of unacceptable prior treatment related to a prior/other malignancy"
      OR annotationClassification="Redaction" ) )
GROUP BY
  ParticipantBarcode,
  categoryName,
  classificationName) ) AS a
  OUTER JOIN EACH 
       ( (SELECT
  ParticipantBarcode,
  vital_status,
  days_to_last_known_alive,
  ethnicity,
  histological_type,
  menopause_status,
  race
FROM
  [isb-cgc:tcga_201607_beta.Clinical_data]
WHERE
  ( Study="BRCA"
    AND age_at_initial_pathologic_diagnosis<=50
    AND gender="FEMALE" )) ) AS c
  ON
    a.ParticipantBarcode = c.ParticipantBarcode
  WHERE
    (a.ParticipantBarcode IS NOT NULL
      OR c.ParticipantBarcode IS NOT NULL)
  ORDER BY
    a.classificationName,
    a.categoryName,
    a.ParticipantBarcode,
    c.ParticipantBarcode )
WHERE
  ( a.categoryName IS NULL
    AND a.classificationName IS NULL
    AND c.ParticipantBarcode IS NOT NULL )
ORDER BY
  c.ParticipantBarcode

In [24]:
q.execute_dry_run()


Out[24]:
{u'cacheHit': True,
 u'referencedTables': [{u'datasetId': u'tcga_201607_beta',
   u'projectId': u'isb-cgc',
   u'tableId': u'Annotations'},
  {u'datasetId': u'tcga_201607_beta',
   u'projectId': u'isb-cgc',
   u'tableId': u'Clinical_data'}],
 u'totalBytesBilled': u'0',
 u'totalBytesProcessed': u'785007'}

In [ ]: